-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the evaluation method to pass the dataset item to the scoring method #698
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the tests:
- Please update
unit.evaluation.test_evaluate.test_evaluate_happyflow
so that task outputs don't duplicate dataset items content. Also updatee2e.test_experiment.py
- Add a new test similar to
unit.evaluation.test_evaluate.test_evaluate_happyflow
, but which will use mapping functionality.
It should be enough.
@@ -21,6 +21,9 @@ def evaluate( | |||
nb_samples: Optional[int] = None, | |||
task_threads: int = 16, | |||
prompt: Optional[Prompt] = None, | |||
scoring_key_mapping: Optional[ | |||
Dict[str, Union[str, Callable[[dataset_item.DatasetItem], Any]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataset_item.DatasetItem
is not a part of our public API, users manipulate with dictionaries.
So, it's better to allow them to specify callables that take dataset item dictionary as input.
We can also make type hints more verbose with aliases:
DatasetItemDict = Dict[str, Any]
...
scoring_key_mapping: Optional[
Dict[str, Union[str, Callable[[DatasetItemDict], Any]]]
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UPD. I ran the code, looks like it works with dictionaries as it should, so probably just the type hint is wrong.
@@ -56,12 +56,32 @@ def _score_test_case( | |||
return test_result_ | |||
|
|||
|
|||
def _create_scoring_inputs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to metrics.arguments_helpers.create_score_inputs
and then rename item
to dataset_item
.
A few unit tests would be nice specifically for this function, it's a very important and sensitive piece of logic.
) -> Dict[str, Any]: | ||
mapped_inputs = {**item, **task_output} | ||
|
||
if scoring_key_mapping is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invert if
condition with a "guard clause" and the nesting will be reduced
if scoring_key_mapping is None:
return mapped_inputs
for k, v in scoring_key_mapping.items():
if callable(v):
mapped_inputs[k] = v(item)
else:
mapped_inputs[k] = mapped_inputs[v]
return mapped_inputs
…method (#698) * Updated evaluate function
Details
Update the evaluation method to pass the dataset item to the scoring method